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Abstract 

We provide a review of the state of the art and 
the future of packet processing ami switching. 
The Industry's response to the need for wire- 
speed packet processing devices whose function 
can be rapidly adapted to continuously changing 
standards and customer requirements Is the con- 
cept of special programmable network pro cessors . 
We discuss the prerequisites of processing tens to 
hundreds of millions of packets per second and 
Indicate ways to achieve scalability through paral- 
lel packet processing. Tomorrow s switch fabrics, 
which will provide node-Internal connectivity 
between the Input and output ports of a router or 
switch, will have to sustain terabit-per-second 
throughput. After reviewing fundamental switch- 
ing concepts, we discuss architectural and design 
issues that must be addressed to allow the evolu- 
tion of packet switch fabrics to terablt-per-second 
throughput performance. 

Introduction 

At the turn of (he century. Ihe world is facing a 
fascinating phenomenon: the establishment of 
the Internet as a worldwide communications 
medium for the entire spectrum of communica- 
tion modes: data, voice, video — both real-time 
and non-real time. The Internet's growing popu- 
larity for entirely new applications In the fields 
of e-buslness and entertainment as well as its 
growing use for well-established applications 
such as telephony have resulted In a spectacular 
annual growth factor (4-10) of traffic carried by 
the net 

The fact that the Internet Is able to grow at 
this enormous pace Is — apart from economic 
factors — enabled by the multiplication of opti- 
cal transmission bandwidth made possible by 
wavelength-division multiplexing (WDM) and 
commensurate progress In the packet forwarding 
capability of the network nodes (routers and 
switches). 

Figure 1 shows the anatomy of a modern 
router or switch with Its main functional units: 
line Interfaces, which physically altach multiple 
transmission systems to the node and provide 
framing functionality: network processors, which 



provide the Intelligence and processing power to 
analyze packet headers, look up routing tables, 
classify packets biased on their destination and 
source addresses and other control information 
and (often complex) rules, and provide queuing 
and policing of pickets; the switch labile, which 
provides high-speed (Ideally nonblocking) Inter- 
connection of the node s packet processing units; 
and the system processor, which performs control 
point functions such as route computation and 
box and network management In this article we 
focus on the two (critical functions Involved In 
the forwarding of packets, packet processing and 
box-Internal switching. We then describe the 
evolution of requirements and technical solu- 
tions, and discuss techniques that promise to 
provide the functionality, performance, scalabili- 
ty, and flexibility required In tomorrow's routers 
arid switches. 

Since the Introduction of optical fibers In 
transport networks, the serial time-division mul- 
tiplexing (TDM) I — synchronous optical net- 
work/synchronous digital hierarchy 
(SONET/SDH) transmission speed has grown 



exponentially at a rate of about 30 percent/year 
to reach 40 Gb/s|todBy (Fig. Za). The speed 
Increase ts primarily gated by the electronics of 
the transceivers. Which suggests that the data 
rates win level oflj In the not too distant Tulure. 
Despite amazing progress In high-speed semi- 
conductor technologies, It Is difficult to imagine 
today that serial transmission rates of commer- 
cial transmission systems will grow much higher 
than 100 Cb/s because of the Intrinsic complexity 
and resulting coslsiof the transceiver electronics, 
and, most Important, the availability of a much 
cheaper alternative to Increase fiber transmission 
utilization In the firm of WDM. Deployment of 
WDM transmission technology brought about a 
radical change: suddenly, the overall transmis- 
sion capacity of fiber links grew at a rate of 
about 200 percent/year, already reaching 1.6 Tb/s 
{160 x 10 Gb/s or 40 x 40 Cb/s). WDM technolo- 
gy Is deployed pervasively In the core transport 
networks and Is about to emerge in metropolitan 
networks. Although the WDM capacity trend is 
expected to continue for some time, longer-term 
physical limits wtlljcause saturation — probably 
on the order of about 50 Tb/s. 
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The port speeds of switches and routers 
inevitably had to follow the speed increase of 
serial transmission over fibers (Fig. 2b). This 
was made possible by advances In complemen- 
tary metal oxide semiconductor (CMOS) tech- 
nology combined with design optimizations in 
packet processing and switching hardware. The 
proliferation of WDM transmission systems has 
given rise to an Interesting question: are 
roulers and switches going to deal with the 
multiplication of fiber transmission capacity by 
a corresponding increase of port speeds? For 
reasons that will become clear in the discussion 
' below, we are convinced that port speed will 
continue to increase in line with fiber serial 
transmission rates, but that the necessary 
growth In packet-forwarding capacity will come 
from the nodes' increase tn size rather' than 
port speed (Fig. 2c). 

Building bigger systems implies two things: 
distributing the packet processing over more 
processing units, and making switches with many 
more Input and output ports than available In 
today's designs (Fig. 2d). The technical chal- 
lenges resulting from these new requirements 
are discussed below. 

Network Processors 

Today's network nodes typically employ applica- 
tion-specific Integrated circuits (ASICs) to achieve 
packet Forwarding and classification performance 
commensurate with the data rates of the attached 
links (so-called wire-speed performance). Such 
standard ASICs have a typical development cycle 
after specification of 12-18 months, and, although 
fast In terms of processing and economical in 
terms of silicon area and power consumption, are 
rarely flexible enough for rapid adaptation to pro- 
tocol or standards changes. 

A new type of device promises to solve this 
problem. Instead of having special ASICs 
designed for each switch, router, or WAN access 
device, communications equipment manufactur- 
ers will implement the performance-critical 
packet forwarding functions tn software that exe- 
cute on special-purpose network processors (NPs). 
Thus, manufacturers will be able to add, expand, 
or modify functions for layer 3-7 packet process- 
ing by modifying the NP software instead of 
making time -consuming and expensive hardware 
changes. Dataquest predicts that the pro- 
grammable communications processor market 
will reach SI billion by 2003, which explains the 
amazing Investments startup companies and 
established communications technology vendors 
ere making in the development of NP technolo- 

gy W. 

To Illustrate the operation principle of an 
NP, Fig. 3 shows a generic block diagram. 
Through the "to and from PHY/switch fabric" 
Interface, data of multiple physical Interfaces or 
the switch fabric are transferred to/from the NP. 
The bltstream processors receive the serial 
stream of packet data and extract the Informa- 
tion needed to process the packet, such as the 
IP source/destination address, type of service 
(TOS) bits, or TCP source/destination port 
numbers. The packet Is then written Into the 
packet buffer memory. The extracted control 




I Figure 1- The anatomy of a switch or router. 
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Information Is fed to the processor complex, 
which constitutes the programmable unit of tjhe 
NP. Under program control, the processor] If 
needed, extracts additional Information from 
the packet and submits the relevant part to the 
search engine, which looks up the medhlm 
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8 Figure 4. A load-balancer-based packet dispatcher. 



access control (MAC) or IP address, classifies - 
the packet, or does a virtual circuit/path Jdenlfi- 
er (VCI/VFI) lookup ir the packet Is recognized 
as an asynchronous transfer mode (ATM) ceil 
using the routing and bridging tables and appro- 
priately designed hardware assists. Based on the 
results returned, the processor Instructs the 
scheduler to determine the appropriate depar- 
ture time of the packet. Upon packet transmis- 
sion through the bltstream processor, the 
necessary modifications to the packet header 
are performed. 

Taking a closer look at the three major build- 
ing blocks (processor complex, packet buffer 
memory, and lookup and classification engine), 
we shall now discuss present and future require- 
ments, and the resulting design Issues and imple- 
mentation challenges. 

The Processor Complex 
Internet backbone link rates will evolve In the 
foreseeable future from today's OC-48 (2.5 
Cb/s) to OC-192 (10 Gb/s) and even OC-768 (40 
Gb/s). With a minimum packet size In the range 
of 40 bytes (TCP/IP ACK or SYN packets), 
these rates translate into wire-speed forwarding 
requirements of 6. 25. and 100 million packets/s. 

Application benchmarks on general-purpose 
CPUs arrive at 2-3 instructions/packet byte for a 
single routing table lookup [2]. Taking Into 
account that simple layer 2/3 forwarding opera- 
tions require multiple lookups per packet (MAC 
address resolution, MAC address learning. IP 
lookup) a minimum CPU performance of 2.5, 
10. or 40 billion instructlons/s for handling OC- 
48, OC-192, or OC-768, respectively, will be 
required. More advanced networking functions 
such as virtual private networks (VPN) and Qual- 
ity of service (QoS) typically use encryption, 
data compression, and packet classification, 
which requires one to two orders of magnitude 
more processing power. 

Today's NP products cover the performance 
range up to OC-48 and provide the necessary 
processor performance with on-chip multiproces- 
sor clusters thai employ optimized Instruction 
sets and dedicated hardware assists to offload 
performance-critical functions: address lookup, 
classification, encrypt Ion/decryption, header 
checksum calculation, and so on. 



These hardware assists usually function as 
coprocessors, whereby Instruction calls are Inte- 
grated as elementary machine instructions in the 
instruction set architecture of the CPU. This 
way, complex functions, which would require a 
substantial amount of native processor instruc- 
tions of the CPU. can be dispatched with a sin- 
gle instruction and executed concurrently. A 
prerequisite for the actual Increase in packet 
throughput Is that the CPU does not MJe while 
waiting for a coprocessor to complete lis opera- 
tion. Idling Is avoided by either providing multi- 
threading capabilities in the processor via 
hardware-assisted register-bank swapping (o 
simultaneously work or multiple packets, or by a 
sophisticated pipeline architecture of the entire 
NP in which the offload functions complete their 
tasks prior to using the respective results In the 
main processor. For both solutions, it Is Impor- 
tant that the Instruction set be open to third- 
party software providers for code base and tool 
development. 

Although today's high-end general-purpose 
processors can execute on the order of 1 bil- 
lion Instructlons/s. specialized instruction set 
processors rarely achieve the system clock rates 
and architectural features of their multi-Issue 
superscalar or very long Instruction word 
(VLI W) counterparts. The hardware assists 
described above may reduce the required 
amount of CPU instructions per packet by a 
factor of 3 (for simple forwarding) to 100 (for 
more complex encryption/decryption and data 
compression). However, this does not suffice to 
bring the workload to a low enough level Tor a 
single processor. 

For these reasons, today's high-end NPs 
employ multiple (e.g.. 16) multithreaded pro- 
cessor cores clustered Into one processor com- 
plex. A key technical challenge Is to ensure that 
the inlerprocessor communication overhead (to 
preserve the packet sequence and synchroniza- 
tion of data flow-related state Information) 
does not cancel the performance gain of paral- 
lel processing. A network node that distorts the 
packet order may cause an excessive amount of 
end-to-end retransmission and is therefore 
unacceptable. One way to tackle this problem Is 
to use an ordering unit. On the arriving side, a 
dispatcher dynamically assigns packets to a free 
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. processor Once the processor has finished pre- 
processing the packet, >t indicates this to the order- 
' /;; "V*:fhE i}ni£; and: the packet Is enqueued in the 
?;'sii^jb^uh«l«r4»wriilt buffers.. The ordering unit Is 
3£ N *.ifi charge'of maintaining packet sequence wlth- 
(aid partlMJlar packet flow and usually works 
^w^^^^lNfte^etS^tlnit to enforce this. In addl- 
V* tfon*,to'each processor is eligible to process 
P^^g'J^jigi^feifittinl^lijKflo^.-State information must 
lllt'Sife?^^ memory, and mechanisms for 

^^li|f^l%*0r^ : SertaiiMtlC;n of data access and consls- 
^^^•^■:%$eHcj^rbi^t. berprovided. 

%r<& '(■' « ■- Aft alternative scheme for solving the work- 
^ft^V.'-'ttad assignment problem Is shown in Fig. 4. In 
#V? : '■ • tfifs scheme, the flow-preserving intelligence 
I:"- . ' resides in the dispatcher, which uses a fixed 

"r : . deterministic function to assign flows to proces- 

sors: all packets of one flow will be assigned to 
the same processor. This simultaneously solves 
the packet sequencing problem and serialization 
of access lo pertinent packet-flow data. In tlie 
example shown, a header preclasslfier extracts 
unique flow identification data (IP addresses, 
TCP port numbers, etc.) from the packet head- 
er. A deterministic, static hash function com- 
presses the flow ID to a 13-bit Index, which is 
used to address a moderate-sized SRAM lookup 
table to map the flow to a specific processor. 
When a packet pertaining to a new flow arrives 
(i.e., the SRAM table entry Is empty), it Is 
assigned to the processor with the currently low- 
est toad, and the SRAM table entry Is updated 
accordingly. The actual load of the processor Is 
estimated by monitoring the queue lengths of 
' buffers A, B. .... N. Performance evaluations of 
this approach using real Internet traffic flows 
have shown that the queuing buffers, which 
absorb the difference between aggregate packet 
throughput of the entire processor complex and 
the individual processors' capacity, are drastical- 
ly reduced In size compared to round-robin or 
other non-feedback-based flow-to-processor 
assignment algorithms. Hence, the queuing 
buffers can be implemented on-chip, which 
enables extremely fast realization of the load- 
balancing mechanism. 

Packet Buffer Memory 
As in many other communication subsystems, 
memory access bandwidth to the external 
DRAM-based packet data repository Is the 
scarcest resource In NPs. For this reason, the 
NP's architecture must be designed very careful- 
ly to avoid unnecessary data transfer across this 
memory Interface. ' 

In an NP architecture as depicted In Fig. 3, 
each packet byte may traverse the memory inter- 
face up to four times when encryption/decryp- 
tion or deep packet parsing functions are 
performed. This Is also the case for short pack- 
ets such as TCP/IP acknowledgments, where the 
packet header Is the entire packet: 

• Write packet lo data store on Inbound 

• Read header (= packet) into processor 
complex 

• Write back to memory 

• Read for outbound transmission 

This means that for small packets, which typi- 
cally represent 40 percent of all Internet pack- 
ets, the required memory Interface capacities 



amount lo 10, 40. or 120 Gb/s for OC-48. OC- 
192, or OC-768. respectively. Even the lowest 
of these values, 10 Gb/s, exceeds the access 
rate of today's commercial DRAMs. Complex 
memory-Interleaving techniques that pipeline 
memory access and distribute Individual pack- 
els over multiple parallel ORAM chips carl be 
applied for 10 Gb/s and possibly 40 Gb/s mem- 
ory subsystems. At 120 Gb/s, today's 166 MHz 
DDR SDRAMs would require well over 360- 
blt-wlde memory Interfaces, or typically some 
25 DDR SDRAM chips. This suggests that at 
OC-768 speeds, only on-chip, ultra-wide 
DRAM technology can support the required 
memory access rates. 

The Lookup and Classification Engine 
In the mid-1990s, forwarding tables of Internet 
backbone routers contained some 10,000 entries; 
today they exceed 85.000, and expectations are 
that routing tables with up lo 500.000 entries 
may be required in fewer than five years j3J. 
Emerging security and class-of-servlce require- 
ments, with their need for packet classification, 
add a new dimension to the packet forwarding 
problem: In order to find and apply the appro- 
priate rule from the classification rule base, mul- 
tiple searches or lookups per packet are 
required. The challenge Is to combine high for- 
warding and classification performance with low 
memory usage of classification and forwarding 
tables. The dynamic policy-based networking of 
tomorrow's Internet will require a highly dynam- 
ic classification rule base that supports table 
update frequencies on the order of hundreds of 
updates per second. 

Over the past few years, significant progress 
has been made in the development of forward- 
ing algorithms and Implementations. Most tech- 
niques, however, address only a subset of the 
above parameters (speed, size, and update per- 
formance) |4|. Only recently have algorithms 
been proposed that address the classification 
and forwarding problem in a generic way based 
on worst-case assumptions and no longer rely on 
special properties of the forwarding and classifi- 
cation tables and rules |5]. Related work at IBM 
Research |6] has concentrated on finding 
approaches that Introduce pipelining by segrega- 
tion of the forwarding key in suitable bit fields 
and then dynamically deciding whether t|he 
longest-prefix matched (LPM) lookup of a field 
Is done as a bit test or a table lookup, depending 
on whether the forwarding table is locally sparse- 
ly filled. This approach tends to hold the Infor- 
mation on a specific prefix localized and not 
compressed, allowing fast updates. Lookup times 
equal a single memory access cycle and the table 
size scales better than 0(P), with P being the 
number of prefixes In the forwarding table. Clas- 
sification can be decomposed In a similar fash- 
Ion, allowing parallel range searches. The results 
of the range searches are combined Into a vari- 
able-sized prefix, which can be resolved through 
the above LPM lookup to yield the final classifi- 
cation result 

Because of the Inherent high degree of paral- 
lelism in these approaches. It Is expected that 
within the next few years, packet classification 
and forwarding implemented in hardware will 
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achieve OC-768 link speeds for forwarding table 
sizes of 500,000 entries or more, tens of thou- 
sands of classification rules, dynamically undat- 
able in sub-mllllsecond time, and all stored in 
on-chip DRAMs. 

Outlook: Parallel Network Processors 

Although parallel processing techniques are 
employed inside a single NP (e.g.. in the proces- 
sor complex), no true mulU-NP approach, where 
a set of NPs cooperate to give the appearance of 
a single higher-performance NP. has been pre- 
sented yet Load balancing mechanisms, such as 
the one described above in the context of packet 
header dispatching, have the potential to become 
the key to an efficient mutti-NP solution. For 
example, In order to build an OC-768 NP solu- 
tion out of lower-speed NPs. one could split the 
.40 Gb/s data stream into multiple 10 Gb/s or 
• lower-rate streams using the flow-preserving 
technique described above. Such a solution 
would dramatically reduce the memory Interface 
problems discussed in the "Packet Buffer Mem- 
ory' section. 

Moreover, this technique could be extended 
to provide another very attractive feature: since 
the header preclassiQer In the load balancer Is 
aware of the protocol characteristics of the data 
flows, it may overrule the hash result and assign 
specific flows (eg.. IPsec or ATM) to processors 
optimized for handling these flows. Thus, a het- 
erogeneous set of processors could be supported 
as an alternative to having every processor Imple- 
ment all functions. 

Switch Fabrics 

Switch fabrics serve to interconnect the various 
functional units of a switch or router, in particu- 
lar network and system processors (Fig. 1). The 
two basic functions of a packet switch fabric are 
the spatial transfer (switching) of packets from 
their incoming pons to the destination ports and 
the buffering of packets to resolve content lorJln 
this section, following a brier review of switch 
architecture fundamentals, we discuss recent 
architectural advances in the form of virtual out- 
put queuing In combination with either pure 
Input queues or combined input/output queues, 
and use an Implementation of the latter as a 
case study to point out present and future chal- 
lenges In the realization of high-speed packet 
switch fabrics. The section concludes with a look 
into the more distant future of switching. 

Classic Switch Architectures 
The two classic single-stage packet switch archi- 
tectures a re characterized by the temporal order 
of queuing and switching functions |7). Queuing 
before switching is called Input queuing (IQ) 
(Fig. Sa), and switching before queuing is output 
queuing (OQ) (Fig. 5b). The two architectures 
have different performance behavior. For uni- 
form Polsson traffic. OQ achieves 100 percent 
throughput with infinite FIFO output buffers, 
whereas IQ is limited to 58 percent throughput 
due to the head-of-llne blocking phenomenon. 
For nonuniform or bursty traffic the efficiency of 
IQ can be even worse. In both cases, finite 
buffers may cause packet losses. 



The attractiveness of IQ lies in its simplicity 
and low cost. HoWever. in Ihe early days or fast 
packet switching!, performance was the reason 
why marry switchldesigns adopted the OQ con- 
cept in spite of the more complex and expen- 
sive mulllport bluffers required to enqueue 
multiple simultaneously arriving packets des- 
tined for the sarrje output port (ej>.. Bell Labs' 
Knockout. IBM's Prlzma. NECs Atom, or 
Siemens' Sigma switch). Complexity and cost 
prohibit generously sized output buffers: hence, 
packet losses remain an Issue. To overcome this 
problem, various improvements of the generic 
OQ concept have been proposed: A first 
Improvement Is tpe shared queuing (SQ) con- 
cept [8] (Fig. 5d). ] which reduces the loss proba- 
bility from that of dedicated output queues due 
to better utilization of the limited memory 
space available on the VLSI chips. A second 
improvement of DQ or SQ is In combination 
with IQ [SJ (Fig.i5f), which eliminates loss In 
the output queues by backpressure: that is. the 
Input queues hold back packets If they no 
longer fit into the (output queues. 

VirtuaI Output Queuing 
It is well known that a more sophisticated queu- 
ing discipline cab avoid Ihe IQ head-of-llne 
blocking problem.! What Is needed is to provide 
a separate queue per output at each input (I.e., a 
total of A 2 input queues for an Nx AT switch), 
and an appropriate scheduling algorithm for 
these queues that has global knowledge and 
hence must be centralized. This concept is called 
virtual output queuing (VOQ) (Fig. 5c). although 
the queuing physically occurs at the inputs. Ini- 
tially, VOQ did 4ot receive much attention 
because of the N 1 complexity and the limited 
scalability of the centralized controller. Howev- 
er, advances in CMOS technology and algorith- 
mic Improvement^ have recently changed this. 
The ffi inpul queues have become easier to 
Implement, and thi scalability of the centralized 
controller has been simplified to some degree by 
heuristic, suboptlmal. though reasonably well- 
performing scheduling schemes such as the 
iSLIP algorithm [9] 

An alternative abproach combines VOQ with 
SQ (10] (Fig. 5e). Here, the shared buffer pro- 
vides a repository for the heads of all Input 
queues and hence serves as a contention 
resolver. Consequently, only simple decentral- 
ized schedulers arejrequired at the Input ports, a 
major advantage oil this technique. Compared to 
the combination bf SQ with simple IQ. the 
throughput performance of this concept is very 
robust with respect jto varying traffic characteris- 
tics because head-of-llne blocking Is eliminated. 
In the next subsection we describe this concept 
and key aspects of tits Implementation in more 
detail. 
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■ Figure 5. Switch architectures. 

lossless, nonblocklng. and self-routing building 
block, which autonomously forwards fixed-size 
packets arriving at any of the A? Input ports to 
one or more of the N output ports based on 
switch-Internal port addresses. Packets are phys- 
ically stored in a shared memory and enqueued 
by writing a pointer into a logical output queue. 
Multicast is supported by storing one copy of 
the multicast packet In the shared memory and 
replicating the pointer in the necessary output 
queues. PRIZMA has built-in scaling modes 
[Fig. 6); particularly Important is the ability to 
multiply the port speed byusing multiple switch 
chips in parallel, which is called speed expansion 
(Fig. 6a). and to accommodate more ports in 
port expansion mode, by using multiple chips In 
either a particular single-stage arrangement 
(Fig. 6b) or a multistage arrangement. Further- 
more, speed and port expansion can be com- 
bined (Fta. 6c). 

PRIZMA's packet length Is configurable 
between 64 and 80 bytes to accommodate 53- 
byte ATM cells as well 8s the minimum-size 64- 
byte Ethernet packet. To provide Qo.S. each 
output port logically supports four priority 
queues with transmission priority scheduling and 
a guaranteed bandwidth mechanism. 

For flow control purposes, each (logical) out- 
put queue has a programmable threshold. A 
grant signal is broadcast to all input adapters for 



each queue. It is active when the queue occu- 
pancy is below Its threshold, allowing the input 
adapter to transmit. Using the grant Informa- 
tion, each input adapter can locally Implement 
VOQ In Its scheduler. 

This architecture combines VOQ. SQ. and 
grant flow control. It also features distributed, 
simple schedulers that achieve better scalability 
than the centralized controller approach (Fig. 
5c). Consider, for example, a 32 x 32 port switch 
at OC-192 speed built In the latter architecture. 
A 64 -byte cell takes 51.4 ns to transmit. During 
this cell time, the central controller has to sched- 
ule 32 queues/input port, a total of 1024 queues. 
A future 64 x 64 switch with OC-768 port speed 
would need a single central scheduler that can 
process 4096 queues. In 12.9 ns! In contrast, each 
of the distributed controllers In the combined 
VOQ/SQ architecture only has to deal with 32 
queues hi 51.4 ns for the 32 x 32 example with 
OC-192 ports. For the 64 x 64 case with OC-768 
ports, this grows to only 64 queues in 12.9 ns. 
Moreover, the VOQ/SQ controllers can employ 
simple round-robin scheduling that can be Imple- 
mented in one clock cycle, whereas, even with 
the optimized ISLIP algorithm [9], the central 
controller lakes more than three clock cycles to 
converge. 

The most recent (second-generation) Imple- 
mentation of the PRIZMA architecture consists 
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of a 32-lnput and 32-output port switch chip, in 
which each port runs at 2 Gb/s. achieving an 
aggregate throughput or 64 Gb/s. The chip is 
realized in IBM's 0.25-um CMOS technology. 
Two chips can be operated In speed expansion 
mode, which doubles the port speed to 4 Gb/s 
and the throughput to 128 Gb/s. Currently the 
third-generation PRIZMA chip Is being devel- 
oped. Employing speed expansion with this chip 
results in a single-stage fabric of 2 Tb/s; using 
twofold port expansion In addition will yield 4 
Tb/s. 

The main Implementation challenges of a 
PRIZMA-type terablt-per-second switch ftibric 
are chip-level wiring, moving extremely high 
data rates on and off the switch chips, and power 
dissipation. To illustrate these problems, let us 
consider a 32 x 32 shared-memory design with 
1024 fixed packet memory locations anda port 
speed of 16 Gb/s. Using 0.1 urn CMOS technol- 
ogy, the clock speed would be 2 ns, resulting in a 
32-btt-wide bus for each port-to- memory connec- 
tion. Each of the 32 input ports must reach any 
of the 1024 memory locations, which in turn 
must reach any of 32 output ports. In front of 
each memory there Is a 32 -to- 1 multiplexer, each 
input being a 32-bit bus.- If we were to draw a 
fictitious line in front of the 1024 memories, a 
total of 1024 x 32 x 32 - 1 million wires would 
cross that line. Even when using an aggressive 
metal pitch and a high number of metal layers, 
this fictitious line would have to be 30 cm long, 
which Illustrates the magnitude of the wiring 
problem to be solved. 

The technical problems we face In imple- 
menting the required on- and off-chip data rates 
are equally challenging. If, under the same 
assumptions as above, we postulate, for chip-to- 
chip interconnect, a 2 Gb/s serial link technology 
employing differential signaling and requiring 
200 mW/llnk, eight serial Interfaces per port 
would be needed. Input and output ports count- 
ed together, this results in 1024 pins needed for 



data transfer alone, excluding clocking, power, 
ground, and other I/Osl The chip power required 
would be 51 W — Just for getting data on and 
off chip without taking Into account the power 
needed for the chip's switching function. It is ' 
obvious that novel technological and engineering 
approaches are required to overcome these 
Implementation problems. 

Further down the Road 

WDM technology Is significantly increasing the 
number or channels to be switched for the same 
number of fibers. This means tliat the pre- WDM 
port numbers of 8-32 are growing to hundreds, 
if not thousands, for the core switching nodes. 
Whereas advances in CMOS technology and sin- 
gle-stage expansion Initially allowed the number 
of ports lo grow by factors of 2-4 with a moder- 
ate number of chips, hundreds or thousands of 
ports call for modular multistage Interconnection 
fabrics (Fig. 2d). Terablt-per-second switches 
and routers employing a multistage fabric will 
require hundreds of switching chips. High-speed 
Interconnection of chips, boards, and racks, and 
the associated power and space Issues will be 
among the most challenging problems In the 
design of such systems. 

The common belief Is that optical technolo- 
gies will eventually solve most of these problems. 
Parallel optical Interconnects between the stages 
of multistage switch fabrics are already employed 
in the most advanced terabit systems. These are 
adequate lo meet the more demanding distance 
and speed requirements of these large-scale sys- 
tems without electromagnetic Interference (EMI) 
sensitivity, but they are not yet dense, fast, and 
cheap enough for the future. The speed per 
Interconnection channel will need to be pushed 
toward 10 Gb/s. Cost reductions must come 
from technologies such as Improved VCSEL 
array and VCSEL packaging technology, molded 
plastic connectors, low-cost optical waveguides 
embedded In boards, and passive alignment 
techniques. Ultimately, low-cost (coarse) WDM 
technology may be used Inside boxes to reduce 
the degree of physical parallelism. Waveguides 
could be used to realize the necessary multiplex- 
ing and demultiplexing functions. 

Although it is already clear that optical 
switches will be deployed in the cross-connect 
network layer, It Is less obvious whether optical 
switching technologies could partially replace 
electronic switches In multistage packet switch 
routers. In particular, once the interstage con- 
nections are optical. It would be attractive to 
realize the center stage (s) of a multistage net- 
work by optical switch modules, thereby saving 
part of the costly opto electronic and electro- 
optical conversions of the Interconnection and 
reducing overall power. Since optical switch 
technologies are too slow to switch on a packet- 
by-packet basis, this constraint must be com- 
pensated for by suitable electronic packet 
buffering and switching schemes such that 
longer bursts of multiple packets can be han 
died in one batch by the optical switch. This 
resembles concepts Invented more than a 
decade ago, such as fast circuit switching or 
burst switching. Large-scale switches or routers 
will continue to be built as hybrid systems 
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requiring the flexibility of electronic network 
processors and electronic packet switching 
modules surrounding a potentially optical 
switching stage. In the absence of optical mem- 
ory and logic of sufficient capacity, hybrid 
switches will be with us for a Jong lime to come. 

Concluding Remarks 

Bringing the forwarding capabilities of future 
' switches and routers to the level dictated by 
exploding Internet traffic on one hand and 
rapidly expanding fiber transmission technology 
on the other Is a task that will challenge the 
imagination and technical skills of the commu- 
nications engineering community in the new 
century. As explained in this article, there are 
hardware architectures and designs that exploit 
advances In VLSI and optical technologies, and 
promise to scale to the necessary data rates and 
system sizes. These designs need to be comple- 
mented by equally scalable and reliable net- 
working software. Tomorrow's NP-based 
routers and switches will Include an amazing 
number and variety of processors from plco- 
processors and RISC processors embedded 
within NPs to one or multiple system proces- 
sors for node and network control. A system 
and software structure that optimally distributes 
both packet-by-packet and control processing 
tasks among these different processors will be 
crucial for tomorrow's network nodes. The defi- 
nition of open NP application programming 
Interfaces such as the CPTX standard is a must 
for the industry to successfully address this 
challenge. 
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